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DISCUSSION: THE DANTZIG SELECTOR: STATISTICAL 
ESTIMATION WHEN p IS MUCH LARGER THAN n 

By Michael P. Friedlander and Michael A. Saunders 
University of British Columbia and Stanford University 

1. Computational considerations. When Lasso [11] was proposed, it was 
a computational challenge to solve the associated quadratic program 

Lasso(t) min±||y- X/3\\l s.t. ||/?||i<i 

given just a single parameter t. Two active-set methods were described in 
[11], with some concern about efficiency if p were large, where X is n x p. 
Later when basis pursuit de-noising (BPDN) was introduced [2], the inten- 
tion was to deal with p very large and to allow X to be a sparse matrix or 
a fast operator. A primal-dual interior method was used to solve the asso- 
ciated quadratic program, but it remained a challenge to deal with a single 
parameter. 

The authors' new Dantzig Selector (DS) also assumes a specific parameter. 
It is helpful to state the BPDN and DS models together: 

BPDN(A) minA||/3||i + i||r||2 s.t. r = y-Xf3, 

DS(A) min ||/3||i s.t. ||X T r||oo<A, r = y-X(3. 

f3,r 

For reference purposes we also state the corresponding dual problems: 
BPdual(A) min-y T r + i||r||2 s.t. ||X T r||oo < A, 



DSdual(A) min— y r + A||z|h s.t. \\X r||oo<A, r = Xz. 

r,z 

We congratulate the authors on justifying their Dantzig Selector on detailed 
statistical grounds while also investigating a primal-dual interior method 
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suitable for a sparse or fast-operator X and making codes available through 
^i-magic [1]. The attraction of a pure linear programming (LP) formulation 
is understandable. Our aim here is to help explore the prospects for both 
interior and simplex implementations of DS, and to compare with BPDN. 

The vectors r = y — X/3 and s = —X T r are used often below. 

We now know that the Homotopy [5, 7, 8] and LARS [6] algorithms can 
solve BPDN(A) for all A > 0, and their active-set continuation approaches 
are remarkably efficient if the computed (3 remains sufficiently sparse. Never- 
theless, most of our discussion involves a single A, and although Lasso came 
before basis pursuit, we refer mostly to the de-noising problem BPDN(A) 
because its A is directly comparable to the DS parameter. 

Note from BPdual(A) that an optimal basis-pursuit solution provides a 
feasible solution to DS(A). Both approaches constrain \\X 

f 1 1 oo — ^ while 

keeping \\(3\\i "small," but BPDN strikes a further balance by giving a 
slightly larger ||/3||i and a slightly smaller ||r||2. 

2. The DS implementation. The authors eliminate r from DS(A) and 
formulate their model as the LP problem 

(DS) min l T u s.t. -u < (3 < u, -Al < X T (y - X0) < Al, 

f3,u 

for which ^-magic's MATLAB primal-dual interior solver lldantzig_pd [9] 
is designed. The main work per iteration lies in solving a p x p symmetric 
system 

(2.1) HAf3 = r 3 , H = D 12 + X T (XD U X T )X, 

where D\2 and -D34 are positive definite diagonal matrices. This system is 
solved in lldantzig_pd using a dense or sparse factorization of H if X is 
explicit, or the conjugate-gradient method if X is an operator. 

To save work when n<p, the authors suggest reducing (2.1) to an n x n 
system that involves the matrix I + {XD^X T ){XD^2X T ). Unfortunately 
this loses symmetry (unnecessarily) and becomes increasingly hazardous as 
iterations proceed because D12 approaches singularity. It is hard to recom- 
mend this approach except perhaps for the early iterations. 

3. Test data. Following ^-magic's example, in MATLAB we generated 
data X, y depending on dimensions n,p,T as follows: 

rand (' state ', 0) ; °/« initialize generators 

randn( ' state ' , 0) ; 



q = randperm(p) ; 7, random +/-1 signal 

q = q(l:T); 

beta = zeros(p,l); 
beta(q) = sign(randn(T, 1) ) ; 
[X,R] = qr(randn(p,n) ,0) ; 
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X = X'; °/ n x p measurement matrix 

y = X*beta + . 005*randn(n, 1) ; °/ noisy observations 

Thus, X is dense with orthogonal rows (XX T = I) and j3 should have T 
components close to ±1. We used A = 3e-3 for all test cases. Times are cpu 
seconds on a 3.2 GHz Linux Intel Pentium 4 with 2 GB of memory. 



4. DS and BPDN with interior solvers. To compare with more general 
primal-dual interior solvers, we considered two formulations of the DS prob- 
lem and also the BPDN formulation in [3]: 



min 1 T (v + w) 



(DS1) 



s.t. [X T X -X T X I] 



min 1 {v + w) 



(DS2) 



s.t. 



X -X I 

X T I 



min Al {v + w) + \r r 



(DS3) 



s.t. \X -X I] 



X y, v,w>0, ||s||oo < A, 





" V ' 








w 




y 




r 









_ s _ 







V, W > 0, ||s||oo < A, 



y, v,w>0, 



Table 1 
Dense orthogonal X 





Sizes 




(DS) 
llmagic 


(DS1) 
Pdco 


(DS2) 




(BPDN) 




n 


P 


T 


Pdco 


Cplex 


Pdco 


Cplex 


Greedy 


120 


512 


20 


1.2 


2.7 


3.9 


1.7 


0.2 


0.5 


0.1 


240 


1024 


40 


6.9 


16.1 


24.5 


16.5 


1.0 


4.9 


0.2 


360 


1536 


60 


20.8 


48.9 


75.6 


58.1 


2.4 


15.3 


0.4 


480 


2048 


80 


46.9 


110.7 


171.3 


122.8 


5.0 


34.3 


1.0 


720 


3072 


120 


149.4 


349.7 


550.1 


391.6 


14.6 


109.6 


3.4 


960 


4096 


160 


349.1 


814.0 


1275.7 


855.4 


31.8 


245.3 


9.2 



Cpu time for 15 iterations of three primal-dual interior solvers and T iterations of a 
Homotopy/LARS-type greedy algorithm. 



4 



M. P. FRIEDLANDER AND M. A. SAUNDERS 



where = v — w and ||/3||i = 1 (v + w), and we expect few nonzero elements 
in v and w. We applied lldantzig_pd [9], Pdco [10] and the Cplex barrier 
LP/QP solver [4] to the relevant problem formulations (see Table 1). With 
X dense, all solvers use dense Cholesky factors of matrices of the form 
H = AD\A T + L>2j where A denotes the corresponding constraint matrix 
and D\,D2 are positive diagonal matrices that change each iteration. (We 
modified lldantzig_pd slightly to ensure that its H was recognized to be 
symmetric positive definite.) 

Table 1 shows computation times on increasingly large problems. The l\- 
magic solver is specialized to problem (DS) and operates with X T X only 
once, whereas Pdco must double-handle that matrix in (DS1) and has n + p 
general constraints to deal with in (DS2). Cplex barrier solves all (DS2) 
examples in times midway between those for the other two solvers. 

We see that the solution times are rather large for all DS formulations and 
solvers. In contrast, Pdco is quite efficient on the BPDN problems, primar- 
ily because there are only n general constraints. A minor specialization to 
avoid double-handling X would reduce times further. We expected the Cplex 
barrier QP solver to perform comparably on the BPDN examples (since its 
interior algorithm is similar to that in Pdco). In case unbounded variables 
were not handled well by Cplex's barrier implementation, we added bounds 
on r enforcing ||r||oo < ||y||2, but the times remained essentially the same. 

The greedy method listed in Table 1 is an experimental MATLAB active- 
set method intended for problem BPDN(A) with a specific A. Like Homotopy 
and LARS, it starts with (3 = and selects one parameter at a time — in this 
case, the one whose dual constraint is most violated. It required exactly T 
iterations on these examples, each involving multiplications with X and X T 
(to compute r and s) and a QR factorization of S, the columns of X chosen 
so far. 

If T were changed in each test case, the solution times for the interior 
methods would be essentially unaltered, but for the greedy method they 
would change in proportion to T. 

If X is sparse but X T X is not, interior solvers on (DS2) could potentially 
be more efficient than on (DS) or (DS1). However, in trying to generate 
random sparse examples we found that the expected T nonzero parameters 
were not correctly identified. The sparse X case remains for study. Both 
lldantzig_pd and Pdco allow X to be an operator, but we have not com- 
pared those options. 

Donoho and Tsaig [5] give related computational results for Homotopy, 
Pdco and simplex for the basis-pursuit case r = (another LP setting!), 
with n,p, T as large as 1600, 4000, 320 and dense X drawn from the Uniform 
Spherical Ensemble. Again the greedy Homotopy approach performs best. 
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5. DS and the simplex method. It seems clear that formulations (DS) 
and (DS1) are not well suited to general-purpose simplex codes for two 
reasons: the presence of a potentially dense X T X, and the large number of 
constraints (viz., p). 

For a time, we thought that formulation (DS2) might be ideal for large- 
scale simplex solvers such as in Cplex. This would be for a specific A and 
values of T up to a few hundred, or a few thousand if X were sparse. If the 
initial basis includes r and s (with nonbasic variables v = w = 0) , the initial 
primal and dual variables can be cheaply computed from 



(5.1) 



Note that the initial dual values f = s = are dual feasible, and r will 
remain in the basis throughout. We hoped that the dual simplex method 
would proceed in an essentially greedy fashion until T components of v or 
w replaced T components of s. The basis would remain almost triangular 
and therefore easy for a typical sparse LU factorization. If we partition 
X = [Z S] to match the current zero and nonzero parameters, the basis 
LU factors take the form 
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with LU 



-S T S, 



where S is the same as S with columns scaled by ±1 according to whether 
an element of v or w is basic. The work per iteration with such factors is 
much the same as for Homotopy/LARS: multiplications by X and X T and 
factorization of S T S. (A specialized basis factorization could account for the 
special structure of S T S and compute a QR factorization of S.) 

A specialized simplex solver could be constructed to use the same LU 
factorization even if X is an operator. Ideally, S would be kept in memory 
as its columns come and go. 

Further, we note that if all values of A are of interest, problem (DS2) 
may be treated as an LP problem with parametric bounds. A simplex-type 
algorithm for such problems is known [12] that works directly with the origi- 
nal variables and constraints. Thus a Homotopy/LARS-type algorithm does 
indeed seem practical at first sight. 

The most effective Cplex simplex options we could find were dual simplex, 
no scaling, no presolve and steepest-edge pricing. Results are summarized in 
Table 2. Unfortunately, it appears that simplex methods work in a "far 
from greedy" manner. In a genuinely optimal solution, many more than T 
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Table 2 
Dense orthogonal X 





Sizes 






tol = 


0.1 




tol = 0.01 
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= 0.001 




n 


P 
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\s\ 


Time 


itns 


\s\ 


Time 


itns 


\s\ 


Time 


120 


512 


20 


20 


20 


0.1 


20 


20 


0.1 


86 


63 


0.2 


240 


1024 


40 


58 


56 


0.4 


67 


57 


0.4 


405 


150 


2.3 


360 


1536 


60 


187 


134 


2.3 


655 


156 


7.8 


1231 


215 


15.1 


480 


2048 


80 


163 


122 


3.4 


549 


211 


11.1 


1277 


275 


26.7 


720 


3072 


120 


356 


223 


15.3 


1414 


317 


65.0 


3006 


420 


146.6 


960 


4096 


160 


965 


414 


80.2 


6226 


488 


574.9 


9229 


567 


891.6 



Cplex dual simplex on problem (DS2) with loose and tighter termination tolerances. 



Dual simplex, tol = 0.1 Dual simplex, tol - 0.001 
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Fig. 1. Cplex dual simplex method on 240 x 1024 problem (DS2) with T — 40 nonzero 
"true" parameter values o/±l. Plot of significant (top) and small (bottom) solution values 
with two termination tolerances. More small values imply more simplex iterations and more 
time per iteration. 
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parameters enter the basis, and the number of simplex iterations exceeds T 
by a huge factor. (Thus, many parameters must be getting selected and then 
rejected.) This has dampened our optimism for the effectiveness of simplex 
on large-scale DS problems. 

On the other hand, the degree of optimality required can have a profound 
effect. Table 2 shows the trend with several feasibility and optimality toler- 
ances (tol = 0.1, 0.01 and 0.001). We would normally regard tol = 0.1 as 
unusually "loose," but Figure 1 emphasizes the benefit of terminating early 
(at the risk of violating HX^YHoo < A by as much as toll). 

6. Conclusions. We have tested interior solvers on three DS formula- 
tions, and compared with three BPDN solvers on the same data. Table 1 
results confirm that the larger DS constraint matrix is likely to invoke a high 
computational cost compared to the Lasso/BPDN model. 

In keeping with the DS name, we have also tested some simplex codes 
(which seem necessary if a range of A values is of interest). Table 2 again 
predicts a high cost, except perhaps if low-accuracy solutions are acceptable. 

Tables 1 and 2 and Figure 1 can be reproduced using the MATLAB scripts 
in www . cs . ubc . ca/labs/ scl/ds_discussion . html. 

We emphasize that the solvers tested are general purpose. They would 
all be "happier" if the dense data X were sparse, and none of them takes 
advantage of the property XX T = I (which may arise in certain situations). 
We wish the authors much success in exploring the virtues of their linear 
DS model for an increasing range of real-world applications. 
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